New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS-Based actor management implementation #6763
Conversation
Can one of the admins verify this patch? |
Test FAILed. |
Test FAILed. |
9e82612
to
979cec7
Compare
Test FAILed. |
Test FAILed. |
979cec7
to
bf7e2d8
Compare
Test FAILed. |
bf7e2d8
to
6debd96
Compare
Test PASSed. |
1ccf692
to
4642605
Compare
Test PASSed. |
Test PASSed. |
4642605
to
2b8101e
Compare
Test PASSed. |
c3b1100
to
289feb7
Compare
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
289feb7
to
f428d4d
Compare
Test FAILed. |
(please feel free to re-request once not WIP) |
Sure, I'm not attached to the testing method that I suggested, and thanks for correcting me on the "bugs" I thought were there :) The main thing I care about is making sure that the tests are reliable and deterministic, as you say. So I think your suggestions about having the retry function be virtual would work as well. Could you add this to the current PR and remove the dependency on the io_context? I think this is important enough that we should have it in the initial commit for actor management. Thanks! |
Okay, after some discussion offline, let's get this PR merged without the changes to the unit tests, but the next focus should be on removing the io_context dependency from the unit tests. By the way, the Travis failures on OSX and RLlib seem like they're related to this PR. Let me know if I can help with debugging. |
Test PASSed. |
33f4625
to
98184b7
Compare
Test PASSed. |
OneNodeTest() : GcsActorSchedulerTest(1) {} | ||
}; | ||
TEST_F(GcsActorSchedulerTest, TestScheduleFailedWithZeroNode) { | ||
ASSERT_EQ(0, gcs_node_manager_->GetAllAliveNodes().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the actual value goes first and the expected value goes second. Not a big deal, but it makes the output nicer if the values do not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks great! I just left two last comments, about notifying the TaskManager when the creation task finishes and about whether we need to flush the actor's leased worker address.
58cc70c
to
57b91a8
Compare
57b91a8
to
adaea18
Compare
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
b9c0278
to
a639ffa
Compare
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Why are these changes needed?
Pls see the <Design Document> first.
This PR implements the creation and reconstruction of actors based on gcs server.
Changes on gcs server side
Several important classes are added:
GcsActor
,GcsActorManager
,GcsActorScheduler
.ActorTableData
and provides some simple interface to access the field insideActorTableData
.GcsActorManager
, it also contains a inner class calledGcsLeasedWorker
which is an abstraction of remote leased worker in raylet.In addition, this PR has also made some changes to
GcsNodeManager
, it is responsible for monitoring and manage nodes.Changes on raylet side
Chages on worker side
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.